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We have used Fortran 90 to implement lattice QCD. We have designed a set of 
machine independent modules that define fields (gauge, fermions, scalars, etc..) 
and overloaded operators for all possible operations between fields, matrices and 
\ numbers. With these modules it is very simple to write high-level efficient programs 

for QCD simulations. To increase performances our modules also implements as- 
signments that do not require temporaries, and a machine independent precision 
definition. We have also created a useful compression procedure for storing the 
\ lattice configurations, and a parallel implementation of the random generators. We 

have widely tested our program and modules on several parallel and single processor 
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, supercomputers obtaining excellent performances. 



^ ' 1 Introduction 



Simulating Quantum Chromo Dynamics (QCD) on the lattice is a challenging 
task for the computers of today. The simulations involve a huge number of 
variables and are so demanding on the computer resources that QCD simula- 
tion is already regarded as a benchmark test for the efficiency and performance 
of the modern supercomputers. With the twofold goal of facilitating the de- 
velopment of algorithms and applications for lattice QCD, and of maintaining 
good code performance, we have taken advantage of the possibilities offered 
by Fortran 90 to write a set of modules for a high-level, yet efficient imple- 
mentation of QCD simulations. Fortran 90 offers the possibility to define both 
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types and overloaded operators. These two key features make Fortran 90 par- 
ticularly suitable for writing QCD programs at a very high level where the 
algorithms are transparent and compact. On the other hand, by maximizing 
efficiency of the code in the definition of the overloaded operators, uniform 
and automatic efficiency is ensured in the entire simulation with little effort 
required from the high level programmer. 

Our end product is a package, "QCDF90" , which is fully described in a long 
documentation [1], where we provide all the information needed to use our 
package. In the following we will highlight the main characteristics of QCDF90. 



2 Precision 

To render, from the beginning, the precision definitions machine independent, 

the module "precision" defines two kind parameters, REALS and LONG. These 
parameters store the kind of an 8-byte floating point variable and of an 8-byte 
integer variable. 



3 Lattice geometry 

The set of all lattice sites can be subdivided into "even" and "odd" sites 
according to whether the sum of the integer valued coordinates x-|-y-|-z-|-tis 
even or odd (checkerboard subdivision). This division is actually quite useful 
in lattice algorithms, especially in the context of parallel implementations. In 
QCDF90 we implement the same checkerboarded separation of the lattice in 
two sublattices. Correspondingly all field variables are divided into even and 
odd variables. We encode this important division in the very first component 
of the type definitions for the fields. This component is an integer variable 
called parity which takes the values and 1 for variables defined over even 
and odd sites respectively. 

For the gauge variables and for the variables associated with the generators it 
is convenient to define a separate type for all four directions. This is realized 
with a second integer component variable dir which takes values from 1 to 4 
for variables defined in the corresponding direction. 

In a vectorized or superscalar architecture pipelined instructions and longer 

arrays give origin to better performance. Therefore the lattice is most ef- 
ficiently indexed by a single lattice volume index ranging xyzt from to 
NX * NY * NZ * NT/2 - 1 for each sublattices (where Ni is the lattice size 
in direction i). 
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4 Field definitions 



Here is a summary of all the types we have defined in the modules: gauge_f ield 
(a 3 * 3 complex matrix in a given direction); fermi_f ield (a 3 component 
complex vector times 4 spinor indices); complex_f ield (a complex scalar); 
real_f ield (a real scalar); generator_f ield (8 real variables in a given di- 
rection, for the SU{3) generators). All the above field are defined on an even 
or odd sublattice. 

In addition wc define the type f ull_gauge_f ield (a collection of 8 gauge_fields) 
to store the entire gauge field configuration (including both parities and all 
directions). Moreover we defined the type matrix as a 3 * 3 complex matrix. 



5 Overloaded operators 

To complete the high-level structure of the code one needs to define high level 
operators between the types described above. Wc define overloaded operators 
for all possible operations between fields, matrices, complex and real numbers. 
The overloaded operator set includes multiplication, multiplication with the 
adjoint, division, addition, subtraction, lambda matrices algebraic manipula- 
tions, gamma matrices algebraic manipulations, adjoin-ing, conjugation, real 
and complex traces, exponentiation, square root, contraction, etc... 

For example, if gi are gauge fields, the use of overloaded operators allows 
instructions to be as simple as: 

gi = g2 * g3, 

specifically, such an instruction means that, at each site of the hipercubical 
lattice, the 3x3 complex matrix multiplication between g2 and gs is performed. 

We have maximized code efficiency in the definitions of the overloaded opera- 
tors. In every case, the overloaded operation is implemented with the minimum 
number of elementary arithmetic operations (see the section on Assignments). 

Further operators have been overloaded to perform special operations involv- 
ing the SU{3) generator _fields and gauge_f ields. In particular it is very 
important to have an efficient algorithm for the matrix exponentiation, since 
this operation can be a time consuming component of several QCD calcu- 
lations. We use a new algorithm that takes advantage of the properties of 
the 3*3 Hermitian traceless matrices to perform the exponentiation with a 
minimal number of arithmetic operations. 
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6 Shifts and pairallel transport 



Shifts are also implemented as overloaded operators. For each field type C- 
shift implements an ordinary shift of the field with respect to the Cartesian 
geometry of the lattice. 

Moreover, because gauge theories are characterized by the property of local 
gauge invariance, we find it very useful to directly define a U-shift operator 
that consists of the shift with the appropriate parallel transport factor. In fact 
in a gauge theory the parallel transport (U-shift) is the relevant shift operator 
and is the natural building block for programming. 

The efficient manipulation of the Dirac operator is the most critical issue in 
QCD. For the Fermi fields we found it useful to define other shift operators 
which incorporate fundamental features of the Dirac operator. Firstly, one 
needs a shift operator that combines the naive shift with the parallel transport 
and the appropriate gamma matrix manipulation. This is implemented by 
the overloaded operators W-shift. Acting on a Fermi field f i, W-shift in the 
direction /i produces a Fermi field f 2, given by 

/2,x=(l-7/.)^x^/l,x+A (1) 

for positive /i, and 

/2,x = (l + 7M)t^^VM-A (2) 

for negative In these equations and gauge link variable 

necessary to implement the proper parallel transport. Combining shift with 
gamma matrix manipulation in the operator W-shift pays off by conserving 
computer resources, in fact, it follows from the properties of the gamma ma- 
trices that only one half of the spin components of the field /i undergo the 
transport at a time. The W-shift is the minimal speciahzed operator that 
achieves this transport. 

The complete lattice Dirac- Wilson operator has been also overloaded, as well 
as the adjoint operator Xdirac= 75 Dirac 75. 



7 Assignments 

The use of overloaded operators may imply the creation of more temporaries 
and, consequently, more motion of data than a straightforward implementation 
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of operations among arrays. Consider for example the following operation 
among variables of type fermiJield: fl = fl + f2 + f3. As far as we know, 
Fortran 90 does not specify how the variables should be passed in function 
calls. As a consequence, the above instruction may require as many as four 
temporaries depending on the operating system. (An operating system that 
implements overloaded operations via function calls would first add f 1 and 
f 2, placing the result in a temporary tl whose address would then be passed 
to the calling program. The compiler would then copy tl into a temporary t2, 
add f3 to t2 and place the result in tl. Finally tl would be copied into f 1.) 
The procedure could be drastically simplified and made system independent 
through the use of an overloaded assignment + =. The above instruction could 
be written fl + = f2 + f3 which the compiler would implement by issuing 
first a call to a function that adds f 2 and fS returning the result in tl. The 
addresses of f 1 and tl would then be passed to a subroutine that implements 
the operation fl = fl + tl among the components of the data types. The 
required number of copies to memory would be only two, instead of four and 
it does not depend on the operating system. 

In order to allow for these possible gains in efficiency, we have defined a large 
set of overloaded assignments. Since Fortran 90 permits only the use of the = 
symbol for the assignment, we have defined two global variables: a character 
variable assign_type and an integer variable assign_spec. The latter is intro- 
duced to accommodate assignments of a more elaborate nature. The default 
values of these variables are "=" and "0". Overloaded assignments are ob- 
tained by setting the assign_type (and, if necessary, the assign_spec) to the 
appropriate value immediately before the assignment. Our example then be- 
comes assign_type =' +'; f 1 = f 2 + f 3. The use of the variable assign_spec 
can be ilhistrated with another example. Consider the U-shift operation that 
implements shifts on gauge fields. There are in fact four U-shifts, correspond- 
ing to the four directions for the shift. To shift the gauge field b in the direction 
n and copy the result in the gauge field a we will use: 

assign_type =' u'; assign_spec = n; a = b 

Note that the assign_type and the assign_spec are automatically reset to 
their default values after each use. Therefore every call to an overload assign- 
ment operation must be preceded by an assignment statement. This ensure 
protection against misuse of the assignment. 

Assignments can be defined between variables of different types. We have 
collected these assignments in the module "assign_mixed" . Often these assign- 
ments have very useful but not necessarily obvious definitions. For example 
if Ci is a complex_field and complex is a complex variable, the instruction 
complex = Ci, is interpreted as setting the variable complex to the sum over 
all the lattice of the components of Ci. 
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References [1] provides a detailed explanation of all the assignment instruc- 
tions. 



8 Conditionals 

The module "conditional" defines six overloaded relational operators, >, >=, 
<, <=, / —, and the .Xor. operator. 

The relational operators perform two tasks: first they return a logical variable 
(true if parity and dir components of the operands are the same, false if 

they are undefined or not the same), second, and more important, they set 
the global variable context to .TRUE, at all sites where the relation is satisfied 
and to .FALSE, at all other sites. 

The operator .Xor. admits as operands a pair of fields of the same type and 
returns a field, also of the same type, having as elements the corresponding 
elements of the first operand at the sites where the global variable context is 
.TRUE, and the elements of the second operand at the sites where context is 



This can be used to select elements out of two fields according to some local 
condition, an operation which lies at the foundation of stochastic simulation 
techniques. 



9 Random numbers 

We have implemented a parallelizable version of the unix pseudorandom num- 
ber generator erand48, which also provides added functionahty. Erand48 is a 
congruential pseudorandom number generator based on the iterative formula 



where ai = 0x5DEECE66D, bi = OxB, Sj and Sj+i are integers of at least 48 bits 
of precision. The "seeds" Sj are converted to real pseudorandom numbers 
with uniform distribution between and 1 by = 2^^^ Si. 

As presented above, the algorithm is intrinsically serial. However it follows 
from Eq. (3) that the N*'^ iterate s^+at is still of the form 



.FALSE.. 



Sj+i — ai * Si + bi mo' 




(3) 



Sj+jv — ajsr * Si + b]\f mod 2 



(4) 
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with integers ajv and b]\[ which are uniquely determined by ai, hi. The module 
takes advantage of this fact and of the definitions of a global variable seeds 
to generate pseudorandom numbers in a parallelizable fashion. The module 
"random_numbers" overloads operators that generate Gaussian or uniformly 
distributed fields and do all the necessary seed manipulations. 



10 Write and read configurations 

To store and retrieve an entire SU(3) gauge field configuration, we have devel- 
oped a portable, compressed ASCII format. Only the first two columns of the 
gauge field matrices are stored, thanks to unitarity and unimodularity. Our 
subroutines takes advantage of the fact that all of the elements of the gauge 
field matrices have magnitude smaller or equal to 1 to re-express their real and 
imaginary parts as 48bit integers. These integers are then written in base 64, 
with the digits being given by the ASCII collating sequence starting from the 
character "0" (to avoid unwanted ASCII characters). Thus, an entire gauge 
field matrix is represented by 96 ASCII characters, without loss of numerical 
information. 

To ensure the recover of the data, a detailed description of the compressing 
procedure is written, as a header, at the beginning of each stored configuration 
file itself. 



11 Performances 

The code has been designed to run in any computer. It has been tested on 
a SGI Power Challenge Array with 90 MHz processor nodes, using IRIX 6.1 
and IRIX 6.2 and Fortran 90, with a single processor -03 optimization flags 
or with the flags -03 -pfa -mp to implement multiprocessing; on a Silicon 
Graphics Indigo using the IRIX 6.1 Fortran 90; and on the IBM R6000 58II 
model 7013 at 55 MHz, running AIX with the xlf90 IBM compiler using the 
-03 optimization flags. 

Memory required to execute varies according to the applications. Scales pro- 
portionally to the lattice volume NX * NY * NZ * NT. On a 16^ lattice, 
the example codes included in QCDF90, quenched.f90 and propagator. f 90 use 
approximately 110 Mbytes and 140 Mbytes respectively. 

The run of the example programs quenched. f90 and propagator.f90 take ap- 
proximately 45 microsec to update an SU(3) link, and 8 microsec to calculate 
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a plaquette, and 20 microsec for a CG step per link, using a 16^ lattice, on an 
SGI Power-Challenge per node. 
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